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Abstract 

This paper proposes a simple yet effective method to 
learn the hierarchical object shape model consisting of 
local contour fragments, which represents a category of 
shapes in the form of an And-Or tree. This model ex¬ 
tends the traditional hierarchical tree structures by intro¬ 
ducing the ''switch’' variables (i.e. the or-nodes) that ex¬ 
plicitly specify production rules to capture shape variations. 
We thus define the model with three layers: the leaf-nodes 
for detecting local contour fragments, the or-nodes specify¬ 
ing selection of leaf-nodes, and the root-node encoding the 
holistic distortion. In the training stage, for optimization 
of the And-Or tree learning, we extend the concave-convex 
procedure (CCCP) by embedding the structural clustering 
during the iterative learning steps. The inference of shape 
detection is consistent with the model optimization, which 
integrates the local testings via the leaf-nodes and or-nodes 
with the global verification via the root-node. The advan¬ 
tages of our approach are validated on the challenging 
shape databases (i.e., ETHZ and INRIA Horse) and sum¬ 
marized as follows. (1) The proposed method is able to 
accurately localize shape contours against unreliable edge 
detection and edge tracing. (2) The And-Or tree model en¬ 
ables us to well capture the intraclass variance. 

1. Introduction 

Detecting and localizing object shapes from images are 
areas of active research. This paper studies a novel shape 
detection method by learning the contour-fragment-based 
shape model. We represent a category of shapes in the form 
of a hierarchical And-Or tree, which can be automatically 
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tree. It comprises three layers: the leaf-nodes for detecting local 
contour fragments, the or-nodes specifying selection of leaf-nodes, 
and the root-node encoding the holistic variance. The bold red 
vertical lines represent the selection of leaf-nodes in the inference. 


learned in a semi-supervised manner. Fig. 1 shows an ex¬ 
ample of learned shape model for horses, consisting of three 
layers. The bottom layer of the tree includes a batch of 
leaf-nodes, i.e., the local classifiers used for localizing con¬ 
tour fragments. The middle layer contains a set of or-nodes, 
each of which explicitly represents a part of the shape and 
specifies a few leaf-nodes for selection; intuitively, one or- 
node can be viewed as a “switch” variable to activate only 
one leaf-node at a time during inference. The top root-node 
(i.e. the and-node) is a global classifier encoding the holistic 
variance and distortion. 
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Literature review. We review the related work in two as¬ 
pects: shape (or contour) matching and shape model learn¬ 
ing. 

(i) Many methods pose the shape detection as a task 
of matching contours in images, and they basically uti¬ 
lize hand-drawn shapes as reference templates [3, 6, 26, 
10, 13, 19, 15]. To overcome the difficulties of occlu¬ 
sions (i.e. missing of true contours of objects) and incom¬ 
plete (broken) contours, a number of robust shape descrip¬ 
tors are extensively discussed such as Shape Context and 
its extensions [3, 15, 26], Contour Flexiblity [22], Local- 
angle [19, 13], as well as effective matching schemes, e.g., 
particle filtering [13], dynamic programming [6], stochas¬ 
tic sampling [10]. For example, Zhu et al. [26] proposed to 
achieve many-to-many matching of contours by using vot¬ 
ing scheme. Riemenschneider et al. [19] addressed partial 
shape matching by identifying matches from fragments of 
arbitrary length to the reference contour. 

(ii) An alternative to shape detection is achieved by 
learning shape models for a given category of shape in¬ 
stances. These methods represent shapes as a loose col¬ 
lection of local contour fragments or an ensemble of pair¬ 
wise constraints [20, 2, 12, 21]. They usually construct 
a codebook of fragments (e.g., PAS [8] and salient con¬ 
tours [20, 11]) and train the model by using the boost¬ 
ing algorithm [20], SVM [21, 8], generative learning [14] 
or Hough-style voting [17]. However, some of them are 
limited to learning with clutter-free shape instances [4], 
and some assume the shape configurations recurring consis¬ 
tently which often suffer from large intraclass variance (e.g. 
articulation) or highly inaccurate edge detection. Recently, 
a state-of-the-art for object detection is achieved by [5], 
where a tree-structure latent SVMs model is trained using 
multi-scale HoG feature. It inspires us to define the tree 
structure shape model; in addition, we extend the structure 
by introducing the “switch” variables (i.e. the or-nodes) that 
explicitly specify production rules to capture large shape 
variations. 

The key contributions of this work are as follows. First, 
we propose the shape model in the form of an And-Or tree 
that enables us to achieve superior performance compared 
to the state-of-the-art approaches. Second, a novel opti¬ 
mization algorithm is proposed to learn the model structure 
and parameters simultaneously in a semi-supervised way. 
There are four key components in our method. 

The leaf-nodes in the And-Or tree model represent a set 
of local classifiers of contour fragments. According to the 
analysis in [15], one of the key challenges in shape detection 
is that true contours of objects often connect to background 
clutters due to unreliable edge detection and contour trac¬ 
ing. Therefore, addressing this problem, we design a partial 
matching scheme that can localize the correct part of the 
contour with the local classifiers. 


The or-nodes in the middle of the model (see Fig. 1) 
are “switch” variables specifying the production rules for 
leaf-node selection. Once a number of contour fragments 
are detected and localized via the local classifiers, each or- 
node is used to select one optimal contour fragment as a 
part of the shape. The benefits of introducing the or-nodes 
are obvious [11, 12]: they provide alternate ways of compo¬ 
sition being significant to address the large intraclass vari¬ 
ance. Moreover, we allow the or-nodes to slightly perturb 
their locations during detection, which accounts for defor¬ 
mation and distortion. In our implementation, we fix the 
or-nodes in a layout of 2x3 blocks. As Fig. 1 illustrates, 
in which each block of or-node is denoted by the red box, 
our model can capture not only the local variant (e.g. part 2) 
but also the inconsistency caused by edge missing or broken 
(e.g. part 3). 

The training of our shape model is posed as an opti¬ 
mization problem of the And-Or tree learning that integrates 
structure learning and parameter learning. We present a 
framework based on the CCCP (concave-convex proce¬ 
dure) [25] by embedding a clustering step during the iter¬ 
ation, namely the dynamic CCCP (dCCCP). 

The inference of shape detection is consistent with the 
optimization of training, including two steps. We first per¬ 
form the bottom-up testings using the leaf-nodes and or- 
nodes. A number of candidate contour fragments are thus 
obtained and some of them are activated via the or-nodes. 
All the selected contour fragments are then combined to¬ 
gether via the root-node for global verification. 

The rest of this paper is organized as follows. We first 
present the shape model with the And-Or tree representa¬ 
tion in Section 2, and follow with a description of shape 
model learning in Section 3. The experimental results and 
comparisons are exhibited in Section 4. A conclusion is 
presented in Section 5. 

2. Shape Model with And-Or Tree 

We introduce the shape model with three aspects: (i) a 
descriptor of shape contour, (ii) the And-Or tree representa¬ 
tion, and (iii) the inference with the learnt model. 

2.1. Contour descriptor 

We start by designing an effective contour descriptor that 
combines the triangle-based feature proposed in [13] and 
Shape Context [3]. As Fig. 2 illustrates, this descriptor is 
suitable for a local contour fragment as well as a group of 
contour fragments representing an object shape. 

To describe a local contour fragment, we first extract a 
sequence of sample points ft from the contour fragment. 
For each point in (2, we compute its triangle-based descrip¬ 
tor as well as the Shape Context descriptor. By combining 
these two types of descriptors for each point in f], we obtain 
a discriminative and deformation-tolerant descriptor for this 





(a) 


(b) 



Figure 2. Illustration of the contour descriptor. The shape context 
feature (in (a)) and the triangle-based feature (in (b)) are com¬ 
puted for a local contour fragment, (c) These two types of features 
are also suitable for a group of contour fragments representing the 
shape instance. 


contour fragment. As for describing an object shape rep¬ 
resented by several contour fragments, the points in Q are 
selected from the whole object shape. 

Given a point Teft, the triangle-based descriptor is a 
histogram of all triangles constructed by T and any other 
two different points (Fig. 2(b)). More precisely, it 

is a 3D histogram of the angles BTA, and two distances 
TA and TB. Note that triangle BTA is oriented clockwise 
and distances TB and TA are normalized by the average 
distance between points in ft. As for the Shape Context 
descriptor, it considers the lengths and orientations of the 
vectors from T to all other points in O. 

In our implementation, the number of sample points in 
O is fixed to 20, and the distances between adjacent points 
in O are equal. We represent the 3D histogram with 2 bins 
for distance TA, 2 bins for distance TB, and 6 bins for an¬ 
gle BT A ranging from 0 to tt. We then transform the 3D 
histogram into a 2x2x6=24-bin I-D feature vector. As for 
representing the Shape Context descriptor, we use 2 bins 
for vector distances, and 6 bins for vector orientations rang¬ 
ing from 0 to 27r. The length of the 1-D feature vector of 
Shape Context is 2x6=12 dimensions. By combining these 
two descriptors as a composite descriptor, the feature vec¬ 
tor of the whole point sequence is the ensemble of the 
composite descriptor of each point in H, with the length 
(24+12) X 20=720 dimensions. 

2.2. And-Or Tree Representation 

Our model is defined with three types of nodes in three 
layers: one root-node (i.e. the and-node), and a number of 
or-nodes and leaf-nodes described by square, dashed circle 
and solid circle, respectively, in Fig. 1. The root and-node 
represents the whole object, and it has 6 children (or-nodes) 
in a layout of 2 x 3 blocks, each representing one part of a 


shape. The number of leaf-nodes is unfixed but less than m 
for each or-node. Assume the maximum number of nodes in 
the model is l+n=l+6+6xm: 0 indexes the root node, i= 
1,...,6 indexes the or-nodes and jf=7,...,n indexes the leaf- 
nodes. We also define that jech{i) indexes the child nodes 
of node i. Note that we index m leaf-nodes for each or-node 
even if some of them do not exist, whose parameters are set 
as 0. We present the definitions for the three types of nodes 
as follows. 

Leaf-node: Each leaf-node Lj j’=7,... ,n is one classifier 
of local contour fragment corresponding to its parent node. 
All leaf-nodes belonging to the same or-nodes (the localized 
block) have the same location in the image. Suppose the 
location of the block is Pi={Pi and a contour fragment 
Cj is selected as input of the classifier. We denote (j)\pi^Cj) 
as the feature vector using the contour descriptor. Note that 
only the part of Cj inside the block is taken into account, 
as Fig. 2 illustrates. If the contour Cj is entirely out of the 
block, (/)^(p^,Cj)=0. Therefore, the response of classifier Lj 
for Cj in location pi is defined as: 

T^j{Pi,Cj)=io\-<j>\pi,Cj), ( 1 ) 

where ccj is a parameter vector. We set <^^=0 if the leaf- 
node Lj is empty. 

Or-node: Each or-node f/i,i=l,...,6 is used to specify 
an appropriate contour fragment from a set of detection can¬ 
didates via its children leaf-nodes Lj . 

In order to encode the shape deformation, the or-nodes 
are allowed to perturb slightly with respect to the shape 
instance. For each or-node Lfi, we introduce an offset 
di={df^dy) to describe the expected spatial position rela¬ 
tive to the position of root node Po=(PoTo)- Suppose the 
or-node block is located at Pi=(pf ,pf), the difference be¬ 
tween Pi and the expected position is (dx^dy), in which 
dx=Pi —{pQ-\-di) and dy=p^ — {p^Td^). Therefore, given 
the or-node Ui, the cost for the deformation of a leaf-node 
Lj is defined as: 

Costij {po ,Pi ) =uj^j • (j)^ {po Ti ), (2) 

where (l)^{po^pi)={dx,dy,dx‘^^dy‘^) is the deformation fea¬ 
ture, and ujj is a 4-dimensional parameter vector for Lj, we 
set ojj=0 if Lj is empty. 

The advantages are very clear for introducing the or- 
nodes in the tree structure. (1) The intraclass variance and 
inconsistency caused by edge computation can be captured 
by different leaf-nodes specified by the or-nodes. (2) The 
location flexibility of or-nodes can deal with the non-rigid 
deformation or local displacement of shapes. 

Root-node: The root node at the top is a global classifier 
for a set of contour fragments proposed by the or-nodes. 
The response of the root-node is defined similarly with the 
local classifiers for the leaf-nodes, as: 

0(C")=u;".0"(C"), 


( 3 ) 

























where (j/{C^) is the feature vector of and is the cor¬ 
responding parameter vector. 

2.3. Inference with And-Or Tree 

Given the learnt And-Or tree model, the inference task is 
to localize optimal contour fragments within the detection 
window. The target shape (i.e. the root-node) is located by 
sliding the detection window at all positions and scales of 
the edge map X. Assuming the location of the root-node is 
Po=(Po 5 -Po)’ describe the inference as follows. 

• Bottom-up local testing: For each leaf-node Lj, as¬ 
sume the block of its parent Ui is located at pi. The detec¬ 
tion score of Lj is calculated by selecting a contour frag¬ 
ment with the highest classifier response, 

SLj{X,pi) = mapZj{pi,Cj) 

CjEX 

= maKUjj-(f>''{pi,Cj). (4) 

CjEX 

The detection score of the or-node Ui is calculated by 
specifying a contour from the candidates localized by its 
children leaf-nodes. The deformation cost for the block of 
Ui is taken into account as well. For clear definition, we 
introduce an auxiliary “switch” vector 
, where and | |v^ 11=1, to indicate which contour is 

chosen from m candidates via Ui. Therefore, the score of 
the or-node is defined as, 

Sui{X,po)=inax ^ {SLj(X,pi)-Vj-Costij{po,pi)-Vj) 

Vi ,Pi _ 

jech{%) 

=max ^ inaDc{(jUy(l)\pi,Cj)-Vj-(jUj-(t)^{pQ,pi)-Vj) 

jech{t) 

= max Y] (5) 

jech(i) 

where is a vector representing the input contours for each 
children leaf-node, Ci={cj^ ,... ). 


(a) (b) 

Figure 3. Illustration of shape detection. The red boxes denote 
bottom-up testings with the leaf-nodes and or-nodes, and the green 
box global verification via the root-node. The two detections (in 
(a) and (b) ) have the similar scores of bottom-up testings (i.e., 
0.561:0.364) but different scores at the root-node (i.e., 0.093 : 
-0.458). 

• Verification via the root-node: We obtain a set of con¬ 
tours based on the local proposals, C'^={ci} where Ci is a 
contour activated by Ui. Then the verification is achieved 


by calculating the response of root-node defined in Equa- 
tion(3). As a result, the overall inference score within the 
detection window is defined as, 

6 

s{x,po)=Y,Sui {x,po)+g{c^) 

i=l 

6 

E max Y {ujj-4>‘{pi,Cj)-Vj-ujj-<f>'‘{po,pi)-Vj) 

YiiPijCi ' 

*=1 jEch(i) 

6 

i=ljech{i) 

( 6 ) 

where V is a joint vector for each v^: l^=(vi,...,V6) = 
{v 7 ,...,Vn), C a joint vector for each c^: C=(ci,...,C6) = 
(c 7 ,...,Cn) and P a vector of the positions of or-nodes: 
P={pi,...,pe). We define H={V,P,C) , and Equation(6) 
can be simplified as, 

S{X,po )=maxcj • (l){X,H,po ), (7) 

H 

where uj is the vector of model parameters and (j){X^H^pQ) 
is the feature vector, 

a;=(a;7,...,a;i,-u;7,...,-a;^,a;^). (8) 

0(A,i7,po) = (0^(pi,C7)-^7,-",0^(P6,Cn)-^n, 

(PO ,Pl ) -^7, • • • ,0^ (PO ,P6 ) •'i^n ,0'’ (G'’) ) . (9) 

We present an example to illustrate the inference with 
the shape model in Fig. 3. The leaf-nodes are used to local¬ 
ize candidate contour fragments and or-nodes to specify the 
optimal ones; the red boxes denote the results of bottom-up 
testings. Then we perform the global verification via the 
root node denoted by the green box, whose significance can 
be clearly demonstrated in the false positive shown in Fig. 3 
(b): the aggregation of local similarities needs to be verified. 

3. Discriminative Learning for And-Or Tree 

The learning of And-Or tree model is an optimization 
problem that integrates structure learning and parameter 
learning. The proposed learning framework enables us to 
learn the structure and the parameters of the model in an al¬ 
ternative way, which is an extension of the original CCCP 
proposed in [24]. The significance of this algorithm is as 
follows. First, we can adjust the layout of parts (decided by 
the or-nodes) accounting for shape variants within the data. 
Second, the leaf-nodes can be automatically merged and 
created to fit the inferred latent variables. More specifically, 
two leaf-nodes having similar discriminative ability (i.e. to 
localize similar contours) are encouraged to be merged into 
one node; one new leaf-node is encouraged to be created 
for detecting the contours that cannot be handled by current 
model. 









































Figure 4. Illustration of structure clustering during the learning 
iterations. We visualize parts of the model in three intermediate 
steps. Note that each part implies an or-node in the model, (a) The 
initial structure, i.e. the original regular layout. Two new struc¬ 
tures are generated along with the changing of latent variables, (b) 
Two leaf-nodes belonging to part 2 are merged together, (c) A new 
leaf-node is created and assigned to part 6. 

3.1. Optimization Formulation 

Given a set of positive and negative training samples 
),...,(X 7 v,^Ar), where X is the edge map within the 
detection window, ^=±1 is the label for X. We assume 
the first K samples indexed from 1 to K are positive sam¬ 
ples. Letting y=l denote the positive samples and y=—l 
the negative samples, we define the feature vector for each 
sample {X^y) as, 

jfy=+l ^ (10) 

where H is the latent variables for X, (j){X^H) is equiv¬ 
alent to 0(X,iT,po), since the position of root-node po is 
fixed. Thus Equation(7) can be rewritten as a discrimina¬ 
tive function, 

S^{X)=argmaXy^H{^'(t>{X,y,H)). (11) 

We can learn the discriminative function(i.e. Equa- 
tion(ll)) by optimizing the target using structural SVM 
with latent variables, as, 

1 ^ 

min-||a;||^+L>^[m^(a;-0(X/e,y,i7)+£(yfc,2/,iT)) 

U) z HtH 

k = l 

-max(a;-0(X/e,yfc,iT))], (12) 

H 

where D is a fixed penalty parameter (set as 0.005 em¬ 
pirically), C{yk^y,H) is the loss function. jC{yk^y,H)=0 if 
yk=y, else >C(p/c,p,iT)=l in our detection problem. 

The optimization target defined in Equation(12) is non- 
convex. The CCCP framework was recently proposed 
in [24, 25] to convert it into a convex and concave form and 
obtain a local optimum solution. Eollowing this framework. 


we rewrite the target as 
1 ^ 

min[- \\ojf+Dy2^^AoJ-<i>{Xk,y,H)+/:iy,„y,H))] (13) 

u; Z y,H 

k = l 

N 

-[Dy]max(cc;-^(Xfc,j/fe,if))] (14) 

k = l 

=min[/(a;)-g(a;)], (15) 

LO 

where /(cc) represents the first two terms in (13), and 
g{uj) represents the last term in (14). However, the orig¬ 
inal CCCP relies on the assumption that the tree structure 
is fixed during the learning iterations, which is not suitable 
for our goal, as we need to simultaneously learn the And-Or 
structures. An extension of CCCP, namely dynamic CCCP 
(dCCCP) is thus proposed to embed structural clustering 
into the model parameter learning. 

3.2. Optimization with dynamic CCCP 

In our learning algorithm, we allow the structure of our 
model to be dynamically adjusted during each iteration of 
learning, as Pig. 4 illustrates. The proposed dCCCP frame¬ 
work iterates with the following three steps. 

Step 1. Por optimization, we first need to construct a 
hyperplane that upper bounds the concave part —g{uj) of 
the target function. Given the parameter vector ujt learned 
from the previous iteration, we find the hyperplane qt such 
that 

-5(‘^)<-5(wt)+(w-wt)-gt,Va;. (16) 

It is performed by searching the best latent variable for each 
training data H^=argmaxH{oi^t'<i>{^k:yk:H)). Note that 
(j){Xk^yk-,H)=^ when yk=—l, thus we only need to esti¬ 
mate the latent variables for positive training data. Then the 
hyperplane is constructed as qt=—D"^^^^(l){Xk,ykiH^). 

Step 2. Given H^={V^ ,Cl) of all positive samples, 
the contour fragments can be localized from each positive 
sample Xk. Por each or-node we obtain a set of ac¬ 
tivated contour fragments from all positive 

samples {Xi,...,Xx}. Among this set, we first group the 
fragments detected via the same leaf-node into the same 
cluster as a temporary partition, and then apply ISODATA 
algorithm to perform re-clustering on these contour frag¬ 
ments. Each contour fragment is described by the fea¬ 
ture and the Euclidean distance is adopted during 

the clustering. The number of clusters are limited to m with 
regard to the parameter uj. After clustering, each cluster 
represents a “potential” leaf-node whose parameters will be 
trained in the step 3. We need to decide the new structure in 
this step and thus assign these potential leaf-nodes to parent 
or-nodes. 

The latent variables for each positive sample is also 
changed from along with the structure adjustment. Sup¬ 
pose the new hyperplane is qf=-DY^^^^(l){Xk,yk,Hk)' 























































To maintain the property in Equation(16), we constrain the 
newly generated qf by \\qt—Qt\\<^ during the clustering 
procedure, where 5 is set manually. Intuitively, we check 
the constraints in each step of splitting or merging clusters, 
which is used to restrict the structure adjustment in an ap¬ 
propriate level. 

Step 3. Given the latent variables and newly generated 
structure, the parameters of the model are learned by solv¬ 
ing the optimization problem: uJt-\-i=argminaj[f{uj)^uj' 
qf]. By substituting /(cc) with the first two terms in Equa- 
tion(13), it can be written as, 

1 ^ 
uj Z 

fc=l 

(17) 

This is a standard structural SVM problem, let 
A(j){Xk,y,H)=(j){Xk,yk,H^)-HXk,y,H), the solution 
can be expressed as, 

u*=D Y, aly,H^^(^k,y,H), (18) 

k,y,H 

where a* can be obtained by maximizing the dual function: 

max ak,y,HC{yk,y,H) 

k,y,H 

-fE E oik,y,Hak' ,yi ,HI A4>(Xu,y,H)A4>{Xki ,y',H'), 
k,k' y,H,y',H' 

(19) 

which is a dual problem in standard SVM. We solve this 
problem by applying the cutting plane method [ ] and Se¬ 
quential Minimal Optimization [18]. Once the parameters 
cct+i is obtained, we repeat the 3-step iteration until the 
function defined in Equation(15) converges. 

3.3. Initialization 

At the beginning of learning, the block of each or-node is 
set by regular decomposition, i.e., {dx,dy) = {0fi). Since no 
leaf-node stands at the beginning, given the positive sam¬ 
ples, we select the contours with largest length for each 
or-node. The structure of our shape model is initialized 
by clustering without any constrains, and the initial latent 
variables are obtained accordingly. The parameters of the 
initialized model can be calculated by solving the standard 
structural SVM problem. 

Algorithm 1 summarizes the overall algorithm of learn¬ 
ing shape model with the And-Or tree. 

4. Experiments 

We apply the proposed method on shape detection from 
images, using the ETHZ database [8] and the INRIA-Horse 
database [9] for validation. 


Algorithm 1 Learning Shape Model of the And-Or tree. 
Input: 

positive and negative training samples, 
{Xk,yk}+\J{Xk,,yk,}-,k=l..K,k'=K+l..N. 

Output: 

The structure and parameters of the shape model. 

Initialization: 

1 Initialize the structure of model and the latent variables. 

2 Initialize the parameters of model. 

repeat 

1 Estimate the latent variables H by applying inference on each 
positive sample (Xk^yk) with the current model. 

2 (a) Localize the contour fragments for each sample {Xk,yk) 

using the current latent variables . 

(b) For each or-node Ui, apply the clustering algorithm with 
constrains on the contours |cl,c?,...,cf^| localized in the 
same block. 

(c) Explore a new structure by re-assigning leaf-nodes with or- 
nodes and modifying the latent variables for each sample 
from HI, to 

3 Estimate the model parameters 00 with the fixed model structure 
and latent variables 

until The target function defined in Equation(15) converges. 



Figure 5. The precision-recall curves on the ETHZ database. The 
black (bold) curves represent the results of our method, and the 
other curves are reported from the previous works. 


Experiment setting. We fix the number of or-nodes in the 
shape model as 6, and the initial layout is a regular partition 
(e.g. 2x3 blocks). The maximum number of leaf-nodes 
for each or-nodes are set as 3. The shape model training 





































































Applelogos 

Bottles 

Giraffes 

Mugs 

Swans 

Average 

Our method 

0.909 

0.898 

0.811 

0.893 

0.964 

0.895 

Maetal. [15] 

0.881 

0.920 

0.756 

0.868 

0.959 

0.877 

Srinivasan et al. [21] 

0.845 

0.916 

0.787 

0.888 

0.922 

0.872 

Maji et al. [17] 

0.869 

0.724 

0.742 

0.806 

0.716 

0.771 

Felz et al. [6] 

0.891 

0.950 

0.608 

0.721 

0.391 

0.712 

Lu et al. [13] 

0.844 

0.641 

0.617 

0.643 

0.798 

0.709 


Table 1. Quantitative results and comparisons with average precision (AP) on the ETHZ database. 


is performed in a semi-supervised manner; the clutter-free 
contours of positive shapes are labeled and the structures of 
the models are determined automatically. We extract edge 
maps for negative examples using the Pb edge detector [16] 
with an edge link algorithm. We adopt PASCAL Challenge 
criterion as the testing standard: a detection is counted as 
valid only if the intersection-over-union ratio (loU) with the 
groundtruth bounding-box is greater than 50%, otherwise 
detections are counted as false positives. We also submit all 
the results of shape detection generated by our method in 
the supplemental material. 

The learning algorithm converges after 5^7 iterations. 
During detection, images were searched at 6 different 
scales, 2 per octave. We carry out the experiments on a 
PC with Core Duo 3.0 GHZ CPU and 16GB memory. On 
average, it takes 4^8 hours for training a shape model, de¬ 
pending on the numbers of training/testing examples; the 
time cost for a detection on a image is around 2^3 minutes. 

Experiment I. We use all five classes of shapes from 
the ETHZ database, (i.e.. Apples, Bottles, Giraffes, Mugs 
and Swans). There are 32^87 images in each class, and 
each image includes at least one shape instance. In the ex¬ 
periments, half of images for each category are randomly 
selected as positive examples, and a comparative number 
of negative examples (70^90) extracted from the remain¬ 
ing categories or backgrounds. The trained shape mod¬ 
els for each category are tested on the remaining images. 
A few typical experimental results are shown in Fig. 6 
(a). For quantitative evaluation, we adopt the Precision- 
Recall (PR) curves and the average precision (AP) as bench¬ 
mark metrics, and compare with the state-of-the-art meth¬ 
ods [17, 21, 6, 13, 15]. The quantitative results are reported 
in Fig. 5 and in Table 1. Our method outperforms on 4 
categories (i.e. Apples/Mugs/Giraffes/Swans) which have 
relatively large intraclass variance or complex backgrounds. 

Experiment II. The INRIA-Horse dataset consists of 
170 images with one or more horses and 170 images with¬ 
out horses, which is more challenging than the ETHZ 
database. Horses appear in images at several scales, and 
against occlusions and cluttered backgrounds. We ran¬ 
domly select 50 positive examples and 80 negative exam¬ 
ples for training and test on the remaining images. Fig. 7 
reports the recall against the number of false-positives av¬ 
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Figure 6. A few representative shape detection results generated 
by our method. Two false positives in (a) are labeled by the bold 
blue frames. 


eraged over all 210 test images (FPPI). Compared with the 
recently proposed methods, our system substantially per¬ 
forms better: we achieve a detection rate of 91.2% at 1.0 
FPPI; the reported results of the competing algorithms are: 
87.3% in [23], 85.27% in [17], 80.77% in [8], and 73.75% 
in [7]. From the results of shape detection, some of them 
are exhibited in Fig. 6 (b), the improvements are basically 
made by the accurate location in the context of (i) inconsis¬ 
tent shape contours (caused by pose variants or occlusions) 


















































































































and (ii) noisy edge maps. 
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Figure 7. Experimental results with the recall-FPPI measurement 
on the INRIA-Horse database. 

5. Summary 

This paper studies a novel contour-fragment-based shape 
model with the And-Or tree representation. This model ex¬ 
tends the traditional hierarchical tree structures by introduc¬ 
ing the or-nodes that explicitly specify production rules to 
capture shape variations. Our approach achieves the state- 
of-art of shape detection on the ETHZ and INRIA-Horse 
databases. Moreover, the algorithm of And-Or tree learning 
is very general and can be applied to other vision tasks. 
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